fix(cpu-ops): lazy transpose for Q8_0 packed tensors by michalharakal · Pull Request #736 · SKaiNET-developers/SKaiNET

michalharakal · 2026-06-15T11:41:00Z

Problem

DefaultCpuOps.transpose rewraps packed bytes with a flipped shape for the K-series (Q4_K/Q5_K/Q6_K) and Q5_0/Q5_1, but Q8_0 falls through to the generic FP32 DenseTensorDataFactory path, which casts the Byte-backed buffer to Float and throws:

ClassCastException: class java.lang.Byte cannot be cast to class java.lang.Float
  at DenseTensorDataFactory.init → DefaultCpuOpsBase.transpose → linearProject

This blocks keeping a Q8_0 matmul weight packed through linearProject (matmul(x, transpose(W))).

Fix

Add the analogous is Q8_0TensorData -> Q8_0BlockTensorData(Shape(cols, rows), d.packedData) case (one line + import). Bytes are layout-agnostic to the kernel's [out, in] block-major convention, so this is a metadata-only (lazy) transpose like the others.

Why it matters

Unblocks FunctionGemma's tied Q8_0 lm_head staying packed in the eager NATIVE_OPTIMIZED path instead of dequanting to FP32 (~0.67 GB), which OOMs the 1.9 GB Astra Machina SL2610.

Verification

SKaiNET-transformers GemmaQ5KPackedParityTest (composite -PuseLocalSkainet=true) now packs the lm_head as Q8_0 and decodes byte-identically to the FP32 baseline. See SKaiNET-transformers #178.

🤖 Generated with Claude Code

ops.transpose rewraps the packed bytes with a flipped shape for the K-series (Q4_K/Q5_K/Q6_K) and Q5_0/Q5_1, but Q8_0 fell through to the generic FP32 DenseTensorDataFactory path, which casts the Byte-backed buffer to Float and throws ClassCastException. Add the analogous Q8_0BlockTensorData case. This unblocks keeping a Q8_0 matmul weight packed through linearProject (matmul(x, transpose(W))) — notably FunctionGemma's tied Q8_0 lm_head, which otherwise has to dequant to FP32 (~0.67 GB) and OOMs the 1.9 GB SL2610 board. Verified: SKaiNET-transformers GemmaQ5KPackedParityTest (eager load(NATIVE_OPTIMIZED)) now packs the lm_head as Q8_0 and decodes byte-identically to the FP32 baseline. See SKaiNET-transformers#178. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

michalharakal merged commit cd2bfd2 into develop Jun 15, 2026
6 checks passed

michalharakal deleted the fix/q8_0-lazy-transpose branch June 15, 2026 11:54

This was referenced Jun 15, 2026

fix(cpu-ops): complete lazy transpose for all packed matmul dtypes (Q4_0) #737

Merged

chore(release): prepare SKaiNET 0.31.0 #738

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cpu-ops): lazy transpose for Q8_0 packed tensors#736

fix(cpu-ops): lazy transpose for Q8_0 packed tensors#736
michalharakal merged 1 commit into
developfrom
fix/q8_0-lazy-transpose

michalharakal commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

michalharakal commented Jun 15, 2026

Problem

Fix

Why it matters

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant